Skip to content

Conversation

@dwalluck
Copy link
Contributor

Remove the scheme field and replace it with a SCHEME constant as its impossible to create a PackageURL with a different scheme. This should not change the public API.

It was previously the case that parsing a purl could produce different URI representations internally based on the number of slashes present after the scheme colon in the input. For example, a purl with the canonical form pkg:generic/name when input starting with pkg:, pkg:/, and pkg:// will produce three different URI representations. By normalizing the input to contain a single '/', as in pkg:/generic/name, we obtain a valid hierarchical URI with all five standard components [scheme:][//authority][path][?query][#fragment] properly parsed. For the URI to be a valid purl, scheme: must be "pkg:", authority must be null, and ?query and #fragment are optional. Then, we only have to parse the path containing the type/namespace/name@version portion of the purl.

Remove the `scheme` field and replace it with a `SCHEME` constant as
its impossible to create a `PackageURL` with a different `scheme`.
This should not change the public API.

It was previously the case that parsing a purl could produce different
`URI` representations internally based on the number of slashes present
after the scheme colon in the input. For example, a purl with the
canonical form `pkg:generic/name` when input starting with `pkg:`,
`pkg:/`, and `pkg://` will produce three different `URI`
representations. By normalizing the input to contain a single `'/'`, as
in `pkg:/generic/name`, we obtain a valid hierarchical `URI` with all
five standard components `[scheme:][//authority][path][?query][#fragment]`
properly parsed. For the `URI` to be a valid purl, `scheme:` must be
`"pkg:"`, `authority` must be `null`, and `?query` and `#fragment` are
optional. Then, we only have to parse the `path` containing the
`type/namespace/name@version` portion of the purl.
Copy link
Contributor

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents:

if (index >= 0) {
this.qualifiers = parseQualifiers(remainder.substring(index + 1));
remainder.setLength(index);
final URI uri = new URI(String.join("/", SCHEME_PART, purl.substring(start)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick! 💯

This way we get the query parsing for free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as I tried to explain in the commit message, the reason that the parsing was not working in the original code is that the canonical purl really should have a single '/', as in "pkg:/", not "pkg:". According to the javadocs for java.net.URI, if it has no '/', then it's not a hierarchical URI, and will not be parsed into its components.

Also according to RFC 2396, having two slashes as in "pkg://" is not allowed unless you have an authority component. In other words, two slashes will make it treat the purl type as the host portion of the URI in this case instead of part if the path.

So the purl spec is not following the RFC is a couple ways. In the past, parsers were more accepting of invalid input, but I am not sure that this is a good idea to keep allowing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PURL spec does follow RFC 2396 and 3986 as far as I can tell. It just chooses to use the opaque_part syntax from RFC 2396.

I agree that using the hierarchical pkg:/ syntax would be easier for implementations, especially in languages that don't have an implementation of RFC 3986. Maybe it is worth to open an issue in the spec repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the spec specifically implies that it is a hierarchical:

The purl components are mapped to these URL components:

purl scheme: this is a URL scheme with a constant value: pkg
purl type, namespace, name and version components: these are collectively mapped to a URL path
purl qualifiers: this maps to a URL query
purl subpath: this is a URL fragment

But, none of this "maps" as written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really expect that they will change any of this, but they really ought to stop referencing wikipedia and reference one or both of these RFCs to clarify these things.

RFC 2396 says:

absoluteURI = scheme ":" ( hier_part | opaque_part )
hier_part = ( net_path | abs_path ) [ "?" query ]
net_path = "//" authority [ abs_path ]
abs_path = "/" path_segments

which means: no empty authority and hier_part has to be an abs_path.

So, I am not sure why you say RFC 2396 and 3986 differ as to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 2396 says:

absoluteURI = scheme ":" ( hier_part | opaque_part )
hier_part = ( net_path | abs_path ) [ "?" query ]
net_path = "//" authority [ abs_path ]
abs_path = "/" path_segments

which means: no empty authority and hier_part has to be an abs_path.

Currently the PURLs don't have a solid / after the schema, so it is of the form scheme ":" opaque_part.

So, I am not sure why you say RFC 2396 and 3986 differ as to this.

RFC 3986 improved the grammar, by replacing opaque_part with path-rootless [ "?" query ], so a compliant RFC 3986 parser would split the opaque part into a path and a query. And RFC 2396 parser does not do it.

Copy link
Collaborator

@jeremylong jeremylong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremylong jeremylong merged commit af34fc4 into package-url:master Mar 9, 2025
2 checks passed
@dwalluck dwalluck deleted the fix-uri-creation branch March 9, 2025 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants